Final IMDB Movie Data Set Project

Emma McLaren

12/9/19

Signature and K-State Honor Code

E. McLaren. "On my honor, as a student, I have neither given nor received unauthorized aid on this academic work."

1 Business Understanding

  • 1.1 Background
  • 1.2 Data Collection Strategy
  • 1.3 Business Problem

2 Data Understanding

  • 2.1 Data Description
  • 2.2 Data Quality
  • 2.3 Data Types
  • 2.4 Variable Selection

3 Data Preparation

  • 3.1 Remove duplicate records
  • 3.2 Resolve missing values
  • 3.3 Clean up column values
  • 3.4 Resolve remaining missing values

4 Data Visualization

  • 4.1 Amount of movies released by year
  • 4.2 Top 10 performing movies by IMDB score and gross sales
  • 4.3 20 movies with the greatest losses
  • 4.4 Trends between content rating and IMDB score and gross sales
  • 4.5 Changes in IMDB score through time
  • 4.6 Density of rating scores by rating level and IMDB score
  • 4.7 Distribution of data between rating levels

5 Correlation Analysis

  • 5.1 Analysis of correlations in Main data set
  • 5.2 analysis of correlations between genres and IMDB scores

6 Findings Summary

References

1. Business Understanding

Background

Understanding and predicting movie success remains a key challenge in the movie industry. Film budgets are incredibly high but can yield great profits if they are managed right. According to researchers in the field, predicting movie success hasn't always been easy. Previously, it was believed that critic reviews or even release schedules played the greated role, but these were simply assumptions at the time that needed more evidence or testing. However, with the assistance of analytic-based approaches, seemingly overwhelming data sets can be turned into insights. As movie researchers and data analysts continue to dig deeper into various data sets, we are able to make more informed decisions as well as recommendations for professionals in the industry.

Data Collection Strategy

Using the data set of 5043 movies listed on IMDB, this project will analyze trends among top performing movies based on multiple variables such as score, duration, and genre. It will also dig deeper to understand what makes these movies successful to help make future predictions.

Business Problem

For the purposes of this project, a movie's success will be measured by its IMDB score.

2. Data Understanding

Import Packages

In this step, I import the packages needed to prepare the assignment.

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline

#scatter matrix
from pandas.plotting import scatter_matrix

#scipy package
from scipy.stats import mannwhitneyu  

#regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
from sklearn import linear_model
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import RFE
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn import metrics
from sklearn.metrics import roc_curve, auc

#confusion matrix
import scikitplot as skplt

#classification
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV 

#clustering
from scipy.spatial.distance import cdist 
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import ward_tree
from scipy.cluster.hierarchy import dendrogram, linkage, ward

Loading data

First I will load the data set into the report.

In [2]:
# load movie_metadata.csv
df = pd.read_csv('data/movie_metadata.csv')
df.head()
Out[2]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

Then I will check the general info of the data set, such as statistics of each column and data types, as well as the correlations between them to get a better understanding of the data.

In [3]:
# check statistics of data set
df.describe()
Out[3]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
count 4993.000000 5028.000000 4939.000000 5020.000000 5036.000000 4.159000e+03 5.043000e+03 5043.000000 5030.000000 5022.000000 4.551000e+03 4935.000000 5030.000000 5043.000000 4714.000000 5043.000000
mean 140.194272 107.201074 686.509212 645.009761 6560.047061 4.846841e+07 8.366816e+04 9699.063851 1.371173 272.770808 3.975262e+07 2002.470517 1651.754473 6.442138 2.220403 7525.964505
std 121.601675 25.197441 2813.328607 1665.041728 15020.759120 6.845299e+07 1.384853e+05 18163.799124 2.013576 377.982886 2.061149e+08 12.474599 4042.438863 1.125116 1.385113 19320.445110
min 1.000000 7.000000 0.000000 0.000000 0.000000 1.620000e+02 5.000000e+00 0.000000 0.000000 1.000000 2.180000e+02 1916.000000 0.000000 1.600000 1.180000 0.000000
25% 50.000000 93.000000 7.000000 133.000000 614.000000 5.340988e+06 8.593500e+03 1411.000000 0.000000 65.000000 6.000000e+06 1999.000000 281.000000 5.800000 1.850000 0.000000
50% 110.000000 103.000000 49.000000 371.500000 988.000000 2.551750e+07 3.435900e+04 3090.000000 1.000000 156.000000 2.000000e+07 2005.000000 595.000000 6.600000 2.350000 166.000000
75% 195.000000 118.000000 194.500000 636.000000 11000.000000 6.230944e+07 9.630900e+04 13756.500000 2.000000 326.000000 4.500000e+07 2011.000000 918.000000 7.200000 2.350000 3000.000000
max 813.000000 511.000000 23000.000000 23000.000000 640000.000000 7.605058e+08 1.689764e+06 656730.000000 43.000000 5060.000000 1.221550e+10 2016.000000 137000.000000 9.500000 16.000000 349000.000000
In [4]:
#check data types of data set
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
color                        5024 non-null object
director_name                4939 non-null object
num_critic_for_reviews       4993 non-null float64
duration                     5028 non-null float64
director_facebook_likes      4939 non-null float64
actor_3_facebook_likes       5020 non-null float64
actor_2_name                 5030 non-null object
actor_1_facebook_likes       5036 non-null float64
gross                        4159 non-null float64
genres                       5043 non-null object
actor_1_name                 5036 non-null object
movie_title                  5043 non-null object
num_voted_users              5043 non-null int64
cast_total_facebook_likes    5043 non-null int64
actor_3_name                 5020 non-null object
facenumber_in_poster         5030 non-null float64
plot_keywords                4890 non-null object
movie_imdb_link              5043 non-null object
num_user_for_reviews         5022 non-null float64
language                     5031 non-null object
country                      5038 non-null object
content_rating               4740 non-null object
budget                       4551 non-null float64
title_year                   4935 non-null float64
actor_2_facebook_likes       5030 non-null float64
imdb_score                   5043 non-null float64
aspect_ratio                 4714 non-null float64
movie_facebook_likes         5043 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1+ MB

From the data types list above, I am able to find a list of object columns that are not useful for further analysis. Though actors could be an interesting way to segment the movies to see which actors were popular among the top movies, the actors in the movie don't always determine the success of a movie. These columns include:

  • plot_keywords
  • movie_imdb_link
  • actor_1_name
  • actor_2_name
  • actor_3_name

From the correlation below, I am able to identify a list of columns that will likely not be useful in further analysis in the context of movie success due to their low correlation with IMDB score. These columns include:

  • actor_1_facebook_likes
  • actor_2_facebook_likes
  • actor_3_facebook_likes
  • aspect_ratio
  • facenumber_in_poster
In [5]:
#check correlation values for all variables with IMDB score to determine relevance for analysis
df.corr()
Out[5]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
num_critic_for_reviews 1.000000 0.258486 0.180674 0.271646 0.190016 0.480601 0.624943 0.263203 -0.033897 0.609387 0.119994 0.275707 0.282306 0.305303 -0.049786 0.683176
duration 0.258486 1.000000 0.173296 0.123558 0.088449 0.250298 0.314765 0.123074 0.013469 0.328403 0.074276 -0.135038 0.131673 0.261662 -0.090071 0.196605
director_facebook_likes 0.180674 0.173296 1.000000 0.120199 0.090723 0.144945 0.297057 0.119549 -0.041268 0.221890 0.021090 -0.063820 0.119601 0.170802 0.001642 0.162048
actor_3_facebook_likes 0.271646 0.123558 0.120199 1.000000 0.249927 0.308026 0.287239 0.473920 0.099368 0.230189 0.047451 0.096137 0.559662 0.052633 -0.003366 0.278844
actor_1_facebook_likes 0.190016 0.088449 0.090723 0.249927 1.000000 0.154468 0.192804 0.951661 0.072257 0.145461 0.022639 0.086873 0.390487 0.076099 -0.020049 0.135348
gross 0.480601 0.250298 0.144945 0.308026 0.154468 1.000000 0.637271 0.247400 -0.027755 0.559958 0.102179 0.030886 0.262768 0.198021 0.069346 0.378082
num_voted_users 0.624943 0.314765 0.297057 0.287239 0.192804 0.637271 1.000000 0.265911 -0.026998 0.798406 0.079621 0.007397 0.270790 0.410965 -0.014761 0.537924
cast_total_facebook_likes 0.263203 0.123074 0.119549 0.473920 0.951661 0.247400 0.265911 1.000000 0.091475 0.206923 0.036557 0.109971 0.628404 0.085787 -0.017885 0.209786
facenumber_in_poster -0.033897 0.013469 -0.041268 0.099368 0.072257 -0.027755 -0.026998 0.091475 1.000000 -0.069018 -0.019559 0.061504 0.071228 -0.062958 0.013713 0.008918
num_user_for_reviews 0.609387 0.328403 0.221890 0.230189 0.145461 0.559958 0.798406 0.206923 -0.069018 1.000000 0.084292 -0.003147 0.219496 0.292475 -0.024719 0.400594
budget 0.119994 0.074276 0.021090 0.047451 0.022639 0.102179 0.079621 0.036557 -0.019559 0.084292 1.000000 0.045726 0.044236 0.030688 0.006598 0.062039
title_year 0.275707 -0.135038 -0.063820 0.096137 0.086873 0.030886 0.007397 0.109971 0.061504 -0.003147 0.045726 1.000000 0.101890 -0.209167 0.159973 0.218678
actor_2_facebook_likes 0.282306 0.131673 0.119601 0.559662 0.390487 0.262768 0.270790 0.628404 0.071228 0.219496 0.044236 0.101890 1.000000 0.083808 -0.007783 0.243487
imdb_score 0.305303 0.261662 0.170802 0.052633 0.076099 0.198021 0.410965 0.085787 -0.062958 0.292475 0.030688 -0.209167 0.083808 1.000000 0.059445 0.247049
aspect_ratio -0.049786 -0.090071 0.001642 -0.003366 -0.020049 0.069346 -0.014761 -0.017885 0.013713 -0.024719 0.006598 0.159973 -0.007783 0.059445 1.000000 0.025737
movie_facebook_likes 0.683176 0.196605 0.162048 0.278844 0.135348 0.378082 0.537924 0.209786 0.008918 0.400594 0.062039 0.218678 0.243487 0.247049 0.025737 1.000000
In [6]:
#find amount of missing values
df.isnull().sum().sort_values(ascending=False)
Out[6]:
gross                        884
budget                       492
aspect_ratio                 329
content_rating               303
plot_keywords                153
title_year                   108
director_name                104
director_facebook_likes      104
num_critic_for_reviews        50
actor_3_name                  23
actor_3_facebook_likes        23
num_user_for_reviews          21
color                         19
duration                      15
facenumber_in_poster          13
actor_2_name                  13
actor_2_facebook_likes        13
language                      12
actor_1_name                   7
actor_1_facebook_likes         7
country                        5
movie_facebook_likes           0
genres                         0
movie_title                    0
num_voted_users                0
movie_imdb_link                0
imdb_score                     0
cast_total_facebook_likes      0
dtype: int64

Another data quality issue with this data set is the missing values. In the code above, I checked for missing values and found there to be a significant amount missing from the gross and budget columns which are incredibly relevant to predicting movie success. Since these data values are incredibly important to determining success, I don't want to fill the missing values with averages or assumptions as this could affect a regression I plan to run later.

Next Steps:

  • Remove any duplicate records
  • Remove entries missing values in Budget and Gross columns
  • Replace remaining missing values with average or likely values
  • Remove irrelevant columns
  • Convert string variables to integer when relevant so it can be tested with regression

3. Data Preparation

Remove Duplicate Records

As identified in the previous section, in order to clean the data and prepare it for analysis, I will need to check for duplicate records to ensure these don't skew the results when analyzing.

According to the code I ran below, there are 45 records that have a duplicate. I then view the data and see the records that are duplicated side-by-side to verify the issue.

In [7]:
# count duplicate records, code sourced from: https://www.ritchieng.com/pandas-removing-duplicate-rows/
df.duplicated().sum()
Out[7]:
45
In [8]:
# examine duplicated rows, code sourced from: https://www.ritchieng.com/pandas-removing-duplicate-rows/
df.loc[df.duplicated(keep=False), :].sort_values('movie_title').head(10)
Out[8]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
4950 Color David Hewlett 8.0 88.0 686.0 405.0 David Hewlett 847.0 NaN Comedy ... 46.0 English Canada NaN 120000.0 2007.0 686.0 7.0 1.78 377
4949 Color David Hewlett 8.0 88.0 686.0 405.0 David Hewlett 847.0 NaN Comedy ... 46.0 English Canada NaN 120000.0 2007.0 686.0 7.0 1.78 377
4408 Color Yimou Zhang 101.0 95.0 611.0 3.0 Ni Yan 9.0 190666.0 Comedy|Drama ... 20.0 Mandarin China R NaN 2009.0 4.0 5.7 2.35 784
3007 Color Yimou Zhang 101.0 95.0 611.0 3.0 Ni Yan 9.0 190666.0 Comedy|Drama ... 20.0 Mandarin China R NaN 2009.0 4.0 5.7 2.35 784
2181 Color Jon Lucas 81.0 100.0 24.0 851.0 Jay Hernandez 15000.0 55461307.0 Comedy ... 46.0 English USA R 20000000.0 2016.0 1000.0 6.7 NaN 18000
2562 Color Jon Lucas 81.0 100.0 24.0 851.0 Jay Hernandez 15000.0 55461307.0 Comedy ... 46.0 English USA R 20000000.0 2016.0 1000.0 6.7 NaN 18000
2798 Color Shawn Levy 69.0 88.0 189.0 799.0 Donald Faison 934.0 47811275.0 Adventure|Comedy|Family ... 99.0 English USA PG 15000000.0 2002.0 927.0 5.4 1.85 896
2628 Color Shawn Levy 69.0 88.0 189.0 799.0 Donald Faison 934.0 47811275.0 Adventure|Comedy|Family ... 99.0 English USA PG 15000000.0 2002.0 927.0 5.4 1.85 896
4942 Color Paul Schrader 130.0 93.0 261.0 697.0 Ruby Dee 783.0 NaN Fantasy|Horror|Thriller ... 106.0 English USA R 18000000.0 1982.0 782.0 6.1 1.85 0
2902 Color Paul Schrader 130.0 93.0 261.0 697.0 Ruby Dee 783.0 NaN Fantasy|Horror|Thriller ... 106.0 English USA R 18000000.0 1982.0 782.0 6.1 1.85 0

10 rows × 28 columns

Since I can see duplicates exist in the data set I will remove the second instance of the duplicate records so that I keep only one record for each movie. After that, I verify that the 45 extra records have been removed.

In [9]:
# drops the duplicate rows, code sourced from: https://www.ritchieng.com/pandas-removing-duplicate-rows/
# update dataset to not include the duplicate records, "inplace=True" code sourced from https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas
df.drop_duplicates(keep='first', inplace=True)
In [10]:
#check for missing values
df.shape
Out[10]:
(4998, 28)

Resolve Missing Values

In the previous section, I identified that the Gross and Budget columns were missing the most values. Since the data in these are so critical to determining movie success, I won't replace them with average values and will instead drop them to ensure the validity of the data.

In [11]:
#find amount of missing values
df.isnull().sum().sort_values(ascending=False)
Out[11]:
gross                        874
budget                       487
aspect_ratio                 327
content_rating               301
plot_keywords                152
title_year                   107
director_name                103
director_facebook_likes      103
num_critic_for_reviews        49
actor_3_name                  23
actor_3_facebook_likes        23
num_user_for_reviews          21
color                         19
duration                      15
facenumber_in_poster          13
actor_2_name                  13
actor_2_facebook_likes        13
language                      12
actor_1_name                   7
actor_1_facebook_likes         7
country                        5
movie_facebook_likes           0
genres                         0
movie_title                    0
num_voted_users                0
movie_imdb_link                0
imdb_score                     0
cast_total_facebook_likes      0
dtype: int64
In [12]:
#handling missing values: remove the rows with any missing value in the budget or gross column
df = df.dropna(subset=['gross', 'budget'])
df.isnull().sum().sort_values(ascending=False)
df.shape
Out[12]:
(3857, 28)

Though this noticeably reduces the amount of data entries that I have, I still retain a large portion of the data volume and can be certain that the data won't be skewed by inputing assumptions.

Remove irrelevant columns

In this section, I will remove the columns I identified in thhe previous section that are not relevant to determining the success of a movie.

In [13]:
# remove irrelevant columnns
df_new = df.drop(['actor_3_facebook_likes','actor_1_facebook_likes', 'facenumber_in_poster', 'actor_2_facebook_likes', 'aspect_ratio', 'plot_keywords', 'movie_imdb_link','actor_2_name', 'actor_1_name', 'actor_3_name', 'genres'], axis=1)
df_new.head()
Out[13]:
color director_name num_critic_for_reviews duration director_facebook_likes gross movie_title num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating budget title_year imdb_score movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 760505847.0 Avatar 886204 4834 3054.0 English USA PG-13 237000000.0 2009.0 7.9 33000
1 Color Gore Verbinski 302.0 169.0 563.0 309404152.0 Pirates of the Caribbean: At World's End 471220 48350 1238.0 English USA PG-13 300000000.0 2007.0 7.1 0
2 Color Sam Mendes 602.0 148.0 0.0 200074175.0 Spectre 275868 11700 994.0 English UK PG-13 245000000.0 2015.0 6.8 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 448130642.0 The Dark Knight Rises 1144337 106759 2701.0 English USA PG-13 250000000.0 2012.0 8.5 164000
5 Color Andrew Stanton 462.0 132.0 475.0 73058679.0 John Carter 212204 1873 738.0 English USA PG-13 263700000.0 2012.0 6.6 24000

Clean up column values

To make the data easier to test with a correlation analysis, I will convert as many object columns into integers so they can be tested.

  • I will code the column Color to 0 and 1 denoting no color and color movies respectively
  • I will code language column to 0 and 1 denoting other languages and English respectively
  • I will code country column to 0 and 1 denoting other countries and USA respectively
  • I will make bins for content rating 0 to 4 representing ratings from Not Rated to NC-17 respectively
In [14]:
#Replace current values of Color and Black and White with 0 and 1
df_new = df_new.replace({'color': ' Black and White'}, {'color': 0})
df_new = df_new.replace({'color': 'Color'}, {'color': 1})
In [15]:
#verify color names have been changed
df_new['color'].value_counts()
Out[15]:
1.0    3725
0.0     130
Name: color, dtype: int64
In [16]:
#get language values to inform categorization
df_new.groupby('language')['movie_title'].count().sort_values(ascending=False)
Out[16]:
language
English       3674
French          37
Spanish         26
Mandarin        14
German          13
Japanese        12
Hindi           10
Cantonese        8
Italian          7
Korean           5
Portuguese       5
Norwegian        4
Hebrew           3
Persian          3
Dutch            3
Danish           3
Thai             3
Dari             2
Indonesian       2
Aboriginal       2
Icelandic        1
Hungarian        1
Arabic           1
Aramaic          1
Bosnian          1
Telugu           1
Czech            1
Swedish          1
Russian          1
Romanian         1
Dzongkha         1
None             1
Filipino         1
Mongolian        1
Maya             1
Kazakh           1
Vietnamese       1
Zulu             1
Name: movie_title, dtype: int64
In [17]:
#Replace current values of other languages and English with 0 and 1 respectively
df_new = df_new.replace({'language': 'English'}, {'language': 1})
df_new = df_new.replace({'language': ['French','Spanish','Mandarin','German','Japanese','Hindi','Cantonese','Italian','Korean','Portuguese','Norwegian','Hebrew','Persian','Dutch','Danish','Thai','Dari','Indonesian','Aboriginal','Icelandic','Hungarian','Arabic','Aramaic','Bosnian','Telugu','Czech','Swedish','Russian','Romanian','Dzongkha','None','Filipino','Mongolian','Maya','Kazakh','Vietnamese','Zulu']}, {'language': 0})
df_new.head()
Out[17]:
color director_name num_critic_for_reviews duration director_facebook_likes gross movie_title num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating budget title_year imdb_score movie_facebook_likes
0 1.0 James Cameron 723.0 178.0 0.0 760505847.0 Avatar 886204 4834 3054.0 1.0 USA PG-13 237000000.0 2009.0 7.9 33000
1 1.0 Gore Verbinski 302.0 169.0 563.0 309404152.0 Pirates of the Caribbean: At World's End 471220 48350 1238.0 1.0 USA PG-13 300000000.0 2007.0 7.1 0
2 1.0 Sam Mendes 602.0 148.0 0.0 200074175.0 Spectre 275868 11700 994.0 1.0 UK PG-13 245000000.0 2015.0 6.8 85000
3 1.0 Christopher Nolan 813.0 164.0 22000.0 448130642.0 The Dark Knight Rises 1144337 106759 2701.0 1.0 USA PG-13 250000000.0 2012.0 8.5 164000
5 1.0 Andrew Stanton 462.0 132.0 475.0 73058679.0 John Carter 212204 1873 738.0 1.0 USA PG-13 263700000.0 2012.0 6.6 24000
In [18]:
#verify language names have been changed and grouped accurately
df_new.groupby('language')['movie_title'].count().sort_values(ascending=False)
Out[18]:
language
1.0    3674
0.0     180
Name: movie_title, dtype: int64
In [19]:
#get country values to inform categorization
df_new.groupby('country')['movie_title'].count().sort_values(ascending=False)
Out[19]:
country
USA               3047
UK                 320
France             106
Germany             81
Canada              64
Australia           41
Spain               22
Japan               17
China               14
India               14
Hong Kong           13
New Zealand         11
Italy               11
Mexico              11
South Korea          9
Denmark              9
Ireland              7
Brazil               5
Norway               4
Iran                 4
Thailand             4
Argentina            3
South Africa         3
Netherlands          3
Russia               3
Israel               3
Czech Republic       3
Romania              2
Iceland              2
Hungary              2
Taiwan               2
Belgium              2
Chile                1
Aruba                1
Colombia             1
West Germany         1
Finland              1
Georgia              1
Greece               1
Indonesia            1
New Line             1
Official site        1
Peru                 1
Philippines          1
Poland               1
Sweden               1
Afghanistan          1
Name: movie_title, dtype: int64
In [20]:
#Replace current values of other countries and USA with 0 and 1 respectively
df_new = df_new.replace({'country': 'USA'}, {'country': 1})
df_new = df_new.replace({'country': ['UK','France','Germany','Canada','Australia','Spain','Japan','China','India','Hong Kong','New Zealand','Italy','Mexico','South Korea','Denmark','Ireland','Brazil','Norway','Iran','Thailand','Argentina','South Africa','Netherlands','Russia','Israel','Czech Republic','Romania','Iceland','Hungary','Taiwan','Belgium','Chile','Aruba','Colombia','West Germany','Finland','Georgia','Greece','Indonesia','New Line','Official site','Peru','Philippines','Poland','Sweden','Afghanistan']}, {'country': 0})
df_new.head()
Out[20]:
color director_name num_critic_for_reviews duration director_facebook_likes gross movie_title num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating budget title_year imdb_score movie_facebook_likes
0 1.0 James Cameron 723.0 178.0 0.0 760505847.0 Avatar 886204 4834 3054.0 1.0 1 PG-13 237000000.0 2009.0 7.9 33000
1 1.0 Gore Verbinski 302.0 169.0 563.0 309404152.0 Pirates of the Caribbean: At World's End 471220 48350 1238.0 1.0 1 PG-13 300000000.0 2007.0 7.1 0
2 1.0 Sam Mendes 602.0 148.0 0.0 200074175.0 Spectre 275868 11700 994.0 1.0 0 PG-13 245000000.0 2015.0 6.8 85000
3 1.0 Christopher Nolan 813.0 164.0 22000.0 448130642.0 The Dark Knight Rises 1144337 106759 2701.0 1.0 1 PG-13 250000000.0 2012.0 8.5 164000
5 1.0 Andrew Stanton 462.0 132.0 475.0 73058679.0 John Carter 212204 1873 738.0 1.0 1 PG-13 263700000.0 2012.0 6.6 24000
In [21]:
#verify country names have been changed and grouped accurately
df_new.groupby('country')['movie_title'].count().sort_values(ascending=False)
Out[21]:
country
1    3047
0     810
Name: movie_title, dtype: int64
In [22]:
#get content rating values to inform categorization and renaming
df_new.groupby('content_rating')['movie_title'].count().sort_values(ascending=False)
Out[22]:
content_rating
R            1723
PG-13        1314
PG            573
G              91
Not Rated      42
Unrated        24
Approved       17
X              10
NC-17           6
Passed          3
M               2
GP              1
Name: movie_title, dtype: int64
In [23]:
#Replace current values of Not Rated, Passed, Approved, Unrated with R, based on historical changes inn rating system: https://www.filmratings.com/History
df_new = df_new.replace({'content_rating': ['Not Rated','Passed', 'Approved', 'Unrated']}, {'content_rating': 'R'})
df_new.head()
Out[23]:
color director_name num_critic_for_reviews duration director_facebook_likes gross movie_title num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating budget title_year imdb_score movie_facebook_likes
0 1.0 James Cameron 723.0 178.0 0.0 760505847.0 Avatar 886204 4834 3054.0 1.0 1 PG-13 237000000.0 2009.0 7.9 33000
1 1.0 Gore Verbinski 302.0 169.0 563.0 309404152.0 Pirates of the Caribbean: At World's End 471220 48350 1238.0 1.0 1 PG-13 300000000.0 2007.0 7.1 0
2 1.0 Sam Mendes 602.0 148.0 0.0 200074175.0 Spectre 275868 11700 994.0 1.0 0 PG-13 245000000.0 2015.0 6.8 85000
3 1.0 Christopher Nolan 813.0 164.0 22000.0 448130642.0 The Dark Knight Rises 1144337 106759 2701.0 1.0 1 PG-13 250000000.0 2012.0 8.5 164000
5 1.0 Andrew Stanton 462.0 132.0 475.0 73058679.0 John Carter 212204 1873 738.0 1.0 1 PG-13 263700000.0 2012.0 6.6 24000
In [24]:
#Replace current values of X, M, and GP with NC-17, PG, and PG respectively based on changes in historical rating system: https://www.filmratings.com/History
df_new = df_new.replace({'content_rating': ['M','GP']}, {'content_rating': 'PG'})
df_new = df_new.replace({'content_rating': 'X'}, {'content_rating': 'NC-17'})
df_new.head()
Out[24]:
color director_name num_critic_for_reviews duration director_facebook_likes gross movie_title num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating budget title_year imdb_score movie_facebook_likes
0 1.0 James Cameron 723.0 178.0 0.0 760505847.0 Avatar 886204 4834 3054.0 1.0 1 PG-13 237000000.0 2009.0 7.9 33000
1 1.0 Gore Verbinski 302.0 169.0 563.0 309404152.0 Pirates of the Caribbean: At World's End 471220 48350 1238.0 1.0 1 PG-13 300000000.0 2007.0 7.1 0
2 1.0 Sam Mendes 602.0 148.0 0.0 200074175.0 Spectre 275868 11700 994.0 1.0 0 PG-13 245000000.0 2015.0 6.8 85000
3 1.0 Christopher Nolan 813.0 164.0 22000.0 448130642.0 The Dark Knight Rises 1144337 106759 2701.0 1.0 1 PG-13 250000000.0 2012.0 8.5 164000
5 1.0 Andrew Stanton 462.0 132.0 475.0 73058679.0 John Carter 212204 1873 738.0 1.0 1 PG-13 263700000.0 2012.0 6.6 24000
In [25]:
#verify content rating names have been changed and grouped accurately
df_new.groupby('content_rating')['movie_title'].count().sort_values(ascending=False)
Out[25]:
content_rating
R        1809
PG-13    1314
PG        576
G          91
NC-17      16
Name: movie_title, dtype: int64
In [26]:
#Replace current values of G, PG, PG-13, R, and NC-17, with 1 to 5 respectively for analysis later
df_new = df_new.replace({'content_rating': ['G']}, {'content_rating': 1})
df_new = df_new.replace({'content_rating': ['PG']}, {'content_rating': 2})
df_new = df_new.replace({'content_rating': ['PG-13']}, {'content_rating': 3})
df_new = df_new.replace({'content_rating': ['R']}, {'content_rating': 4})
df_new = df_new.replace({'content_rating': ['NC-17']}, {'content_rating': 5})
df_new.head()
Out[26]:
color director_name num_critic_for_reviews duration director_facebook_likes gross movie_title num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating budget title_year imdb_score movie_facebook_likes
0 1.0 James Cameron 723.0 178.0 0.0 760505847.0 Avatar 886204 4834 3054.0 1.0 1 3.0 237000000.0 2009.0 7.9 33000
1 1.0 Gore Verbinski 302.0 169.0 563.0 309404152.0 Pirates of the Caribbean: At World's End 471220 48350 1238.0 1.0 1 3.0 300000000.0 2007.0 7.1 0
2 1.0 Sam Mendes 602.0 148.0 0.0 200074175.0 Spectre 275868 11700 994.0 1.0 0 3.0 245000000.0 2015.0 6.8 85000
3 1.0 Christopher Nolan 813.0 164.0 22000.0 448130642.0 The Dark Knight Rises 1144337 106759 2701.0 1.0 1 3.0 250000000.0 2012.0 8.5 164000
5 1.0 Andrew Stanton 462.0 132.0 475.0 73058679.0 John Carter 212204 1873 738.0 1.0 1 3.0 263700000.0 2012.0 6.6 24000

Resolve remaining missing values

In this step, it is critical that the missing values still remaining are either removed or changed. In this case, I will be using imputation to replace null values with each column's respective average value. It appears that the columns still missing values are:

  • content_rating
  • language
  • color
  • num_critic_for_reviews
  • duration
In [27]:
#find remaining missing values
df_new.isnull().sum().sort_values(ascending=False)
Out[27]:
content_rating               51
language                      3
color                         2
num_critic_for_reviews        1
duration                      1
movie_title                   0
director_name                 0
director_facebook_likes       0
gross                         0
movie_facebook_likes          0
num_voted_users               0
imdb_score                    0
num_user_for_reviews          0
country                       0
budget                        0
title_year                    0
cast_total_facebook_likes     0
dtype: int64

To resolve the missing values for content_rating, I will change all null values to the most common rating, "R," which is denoted by 4 in the data set.

In [28]:
# changing content_rating null value to average "R"
#verify no more null values for content_rating
df_new = df_new.fillna({'content_rating': 4})
df_new.isnull().sum()
Out[28]:
color                        2
director_name                0
num_critic_for_reviews       1
duration                     1
director_facebook_likes      0
gross                        0
movie_title                  0
num_voted_users              0
cast_total_facebook_likes    0
num_user_for_reviews         0
language                     3
country                      0
content_rating               0
budget                       0
title_year                   0
imdb_score                   0
movie_facebook_likes         0
dtype: int64

To resolve the missing values for color, I will change all null values to their most likely rating by their year. If before 1939, the color will be denoted by 0 for Black and White. If after 1939, the color will be denoted by 1 for Color.

In [29]:
# changing color null values with 1 if after 1939, 0 if before, code based on Canvas discussion board suggestion: https://k-state.instructure.com/courses/81006/discussion_topics/505217?module_item_id=1875123
#verify no more null values for color
df_new.loc[df_new['color'].isnull() & (df_new['title_year'] < 1939),'color'] = 0
df_new.loc[df_new['color'].isnull() & (df_new['title_year'] > 1939),'color'] = 1
df_new.isnull().sum()
Out[29]:
color                        0
director_name                0
num_critic_for_reviews       1
duration                     1
director_facebook_likes      0
gross                        0
movie_title                  0
num_voted_users              0
cast_total_facebook_likes    0
num_user_for_reviews         0
language                     3
country                      0
content_rating               0
budget                       0
title_year                   0
imdb_score                   0
movie_facebook_likes         0
dtype: int64

To resolve the missing values for language, I will change all null values to their most common result, English, denoted by a 1 in the data set.

In [30]:
# changing language null value with average "1" (English)
#verify no more null values for language
df_new = df_new.fillna({'language': 1})
df_new.isnull().sum()
Out[30]:
color                        0
director_name                0
num_critic_for_reviews       1
duration                     1
director_facebook_likes      0
gross                        0
movie_title                  0
num_voted_users              0
cast_total_facebook_likes    0
num_user_for_reviews         0
language                     0
country                      0
content_rating               0
budget                       0
title_year                   0
imdb_score                   0
movie_facebook_likes         0
dtype: int64

To resolve the missing values for duration, I will change all null values to the average movie duration of the data set.

In [31]:
# changing duration null value with average duration
#verify no more null values for duration
df_new = df_new.fillna({'duration': df_new['duration'].mean()})
df_new.isnull().sum()
Out[31]:
color                        0
director_name                0
num_critic_for_reviews       1
duration                     0
director_facebook_likes      0
gross                        0
movie_title                  0
num_voted_users              0
cast_total_facebook_likes    0
num_user_for_reviews         0
language                     0
country                      0
content_rating               0
budget                       0
title_year                   0
imdb_score                   0
movie_facebook_likes         0
dtype: int64

To resolve the missing values for num_critic_for_reviews, I will change all null values to the average number of critic reviews of the data set.

In [32]:
# changing num_critic_for_reviews null value with average 
#verify no more null values for num_critic_for_reviews
df_new = df_new.fillna({'num_critic_for_reviews': df_new['num_critic_for_reviews'].mean()})
df_new.isnull().sum()
Out[32]:
color                        0
director_name                0
num_critic_for_reviews       0
duration                     0
director_facebook_likes      0
gross                        0
movie_title                  0
num_voted_users              0
cast_total_facebook_likes    0
num_user_for_reviews         0
language                     0
country                      0
content_rating               0
budget                       0
title_year                   0
imdb_score                   0
movie_facebook_likes         0
dtype: int64

Now that I have finished changing the values, I verified that there are no more missing values in the step above, and will also check the length of the cleaned data set to verify that it is the expected length. Since 3857 is the correct number of records, I am done resolving missing values.

In [33]:
#verify length of cleaned data set
len(df_new)
Out[33]:
3857

Separate genres

In a later step, I will want to see the genres' correlations with IMDB scores. I'll use dummy variables to separate the genres out and save them as df1 to be used later.

In [34]:
#now, separate a string of genres into dummy variables
# I borrowed the code below from Midterm discussion board: https://k-state.instructure.com/courses/81006/discussion_topics/505217?module_item_id=1875123
df1 = df.join(df.pop('genres').str.get_dummies('|'))
df1 = df1[['Action','Adventure','Animation','Biography','Comedy','Crime','Documentary','Drama','Family','Fantasy','Film-Noir','History','Horror','Music','Musical','Mystery','Romance','Sci-Fi','Short','Sport','Thriller','War','Western', 'imdb_score']] 
df1.head()
Out[34]:
Action Adventure Animation Biography Comedy Crime Documentary Drama Family Fantasy ... Musical Mystery Romance Sci-Fi Short Sport Thriller War Western imdb_score
0 1 1 0 0 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 0 7.9
1 1 1 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 7.1
2 1 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 6.8
3 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 8.5
5 1 1 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 6.6

5 rows × 24 columns

To ensure I completed this step properly, I check the length of the df1 data set to verify that it is the expected length. Since 3857 is the correct number of records, I have correctly separated the genres into dummy variables without adding extra rows.

In [35]:
len(df1)
Out[35]:
3857

Add profit column and ROI

In order to gain more insights when analyzing the data set, I need to create two extra columns that are calculations based on two other existing columns. The first calculated column is profit, which I calculate as the value if the gross column minus the value in the budget column.

In [36]:
# create new columns
df_new['profit'] = (df_new['gross'] - df_new['budget'])
df_new.head()
Out[36]:
color director_name num_critic_for_reviews duration director_facebook_likes gross movie_title num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating budget title_year imdb_score movie_facebook_likes profit
0 1.0 James Cameron 723.0 178.0 0.0 760505847.0 Avatar 886204 4834 3054.0 1.0 1 3.0 237000000.0 2009.0 7.9 33000 523505847.0
1 1.0 Gore Verbinski 302.0 169.0 563.0 309404152.0 Pirates of the Caribbean: At World's End 471220 48350 1238.0 1.0 1 3.0 300000000.0 2007.0 7.1 0 9404152.0
2 1.0 Sam Mendes 602.0 148.0 0.0 200074175.0 Spectre 275868 11700 994.0 1.0 0 3.0 245000000.0 2015.0 6.8 85000 -44925825.0
3 1.0 Christopher Nolan 813.0 164.0 22000.0 448130642.0 The Dark Knight Rises 1144337 106759 2701.0 1.0 1 3.0 250000000.0 2012.0 8.5 164000 198130642.0
5 1.0 Andrew Stanton 462.0 132.0 475.0 73058679.0 John Carter 212204 1873 738.0 1.0 1 3.0 263700000.0 2012.0 6.6 24000 -190641321.0

The second calculated column is return on investment (ROI), which I calculate as the proportion of profit values to budget values.

In [37]:
df_new['roi'] = df_new['profit'] / df_new['budget']
df_new.head()
Out[37]:
color director_name num_critic_for_reviews duration director_facebook_likes gross movie_title num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating budget title_year imdb_score movie_facebook_likes profit roi
0 1.0 James Cameron 723.0 178.0 0.0 760505847.0 Avatar 886204 4834 3054.0 1.0 1 3.0 237000000.0 2009.0 7.9 33000 523505847.0 2.208885
1 1.0 Gore Verbinski 302.0 169.0 563.0 309404152.0 Pirates of the Caribbean: At World's End 471220 48350 1238.0 1.0 1 3.0 300000000.0 2007.0 7.1 0 9404152.0 0.031347
2 1.0 Sam Mendes 602.0 148.0 0.0 200074175.0 Spectre 275868 11700 994.0 1.0 0 3.0 245000000.0 2015.0 6.8 85000 -44925825.0 -0.183371
3 1.0 Christopher Nolan 813.0 164.0 22000.0 448130642.0 The Dark Knight Rises 1144337 106759 2701.0 1.0 1 3.0 250000000.0 2012.0 8.5 164000 198130642.0 0.792523
5 1.0 Andrew Stanton 462.0 132.0 475.0 73058679.0 John Carter 212204 1873 738.0 1.0 1 3.0 263700000.0 2012.0 6.6 24000 -190641321.0 -0.722948

4. Business Intelligence

Now that the data has been prepared, I will create visualizations to better interpret the trends, patterns, and relationships within the data set, as well as to extract insights. The questions I will answer:

  • amount of movies released by year
  • top 10 performing movies by IMDB score and gross sales
  • top 20 performing directors by IMDB score and gross sales
  • 20 movies with the greatest losses
  • trends between content rating and IMDB score and gross sales
  • changes in IMDB score through time
  • density of rating scores by rating level and IMDB score
  • distribution of data between rating levels

Movies released by year

For this business problem, I want to see a general representation of how the amount of movies released has changed over time. To demonstrate this, I will make a histogram displaying the total movies released by year.

Based on this visualization, I can conclude that over time, our movie releases have grown exponentially to the year 2000. However, after the early 2000s this number begins to decline. This could be caused by a variety of factors, but one likely one may have to do with the recession that effected the American economy around the years of 2008 to 2012.

In [38]:
#movies released per year
bins = 30
plt.hist(df_new['title_year'], bins, color='blue', label= 'Movies released by year')
plt.xlabel('Year')
plt.title('Movies released');

Top 10 performing movies by IMDB score and gross sales?

For this business problem, I want to get an idea of the types of movies that performed the highest in the data set. To demonstrate this, I will create a scatter plot displaying the movie titles by their gross sales and IMDB score. Since we assume in thhis project that IMDB score is an adequate portrayal of a movie's success, I will sort by these values first and then by top gross sales.

Based on this visualization, I can conclude that the top 10 movies from the data set are:

  • The Shawshank Redemption
  • The Godfather
  • The Dark Knight
  • The Godfather: Part II
  • Lord of the Rings: Return of the King
  • Pulp Fiction
  • Schindlers List
  • The Good, the Bad and the Ugly
  • The Lord of the Rings: The Fellowship of the Ring
  • Inception

Clearly, some of these movies are part of a series of movies or are a sequel and its original, which shows that some story lines have more popularity than others.

In [39]:
### top performing directors by imdb score, code based on https://plot.ly/python/basic-charts/
top10mv = df_new[['movie_title', 'director_name','imdb_score', 'gross']]
top10mv = top10mv.sort_values('imdb_score',ascending=False).head(10)
px.scatter(top10mv, x="imdb_score", y="gross", text='movie_title', hover_name='movie_title', hover_data=['director_name','gross', 'imdb_score'],  color="director_name", size='gross')

Top performing directors

For this business problem, I want to get an idea of the movie directors that performed the highest in the data set. To demonstrate this, I will create a scatter plot displaying the directors by their gross sales and IMDB score. Since we assume in this project that IMDB score is an adequate portrayal of a movie's success, I will sort by these values first and then by top gross sales.

Based on this visualization, I can conclude that the top 20 directors are from the data set are:

  • Frank Darabont
  • Francis Ford Coppola x2
  • Christopher Nolan x2
  • Peter Jackson x3
  • Quentin Tarantino
  • Steven Spielberg
  • Sergio Leone
  • Robert Zemeckis
  • David Fincher
  • Irvin Kerschner
  • Akira Kurosawa
  • Milos Forman
  • Fernando Meirelles
  • Lana Wachowski
  • George Lucas
  • Martin Scorsese

These directors listed are prominent in the industry and anyone familiar with the movie industry would recognize some of these notable names. A couple interesting details:

  • Only one of the top 20 movies by top directors includes a woman
  • Some of the top scoring movies didn't have the highest gross sales
  • A few of the successful directors are also successful on more than one movie
In [40]:
### top performing directors by imdb score, code based on https://plot.ly/python/basic-charts/
top20dc = df_new[['movie_title', 'director_name','imdb_score', 'gross']]
top20dc = top20dc.sort_values('imdb_score',ascending=False).head(20)
px.scatter(top20dc, x="imdb_score", y="gross", text='director_name', hover_name='movie_title', hover_data=['director_name','gross', 'imdb_score'],  color="director_name", size='gross')

What movies experienced the greatest losses?

For this business problem, I want to get an idea of the worst performing movies in the data set specifically in relation to the greatest losses. To demonstrate this, I will create a scatter plot displaying the movie titles by their gross sales and IMDB score. For this question, we will sort the subset of data by lowest profits (losses) and then by IMDB score.

Based on this visualization, I can conclude that some of the worst performing movies from the data set are:

  • The Host
  • Lady Vengeance
  • Fateless
  • Princess Mononoke
  • Steamboy
  • Akira
  • Tango
  • Redcliff
  • Godzilla 2000
  • Kabhhi Alvida Naa Kehna
  • Kites
  • Ong-bak
  • The Legend of Suriyothai
  • The Polar Express
  • John Carter
  • The Messenger: The Story of Joan of Arc
  • Battleship
  • The Protector
  • Paa
  • Winged Migration

These movies listed aren't always well-known movies, which would be a realistic assumption. However, there are some movies that are fairly well-known yet have some of the lowest profits that stood out to me:

  • The Polar Express is on this list, despite being a frequently played holiday movie.
  • John Carter is another recent movie that was promoted heavily but fell short of hopes.
  • Godzilla 2000 didn't perform as well as the othher Godzilla movies.
  • The Host has the greatest losses of any movie on this list, and is quite the outlier.
In [41]:
### top performing directors by imdb score, code based on https://plot.ly/python/basic-charts/
last20mv = df_new[['movie_title', 'director_name','imdb_score', 'profit']]
last20mv = last20mv.sort_values('profit',ascending=False).tail(20)
px.scatter(last20mv, x="imdb_score", y="profit", text='movie_title', hover_name='movie_title', hover_data=['director_name','profit', 'imdb_score'], size='imdb_score')

IMDB score by content rating

For this business problem, I want to get an idea of how IMDB score varies among the different content rating levels. To demonstrate this, I will create a scatter plot displaying the data points by their gross sales and IMDB score for each rating level. For this question, I will sort the subset of data by IMDB score first.

Based on this visualization, I can draw some conclusions about the data set:

  • As IMBD score increases, the data is shown to be distributed more heavily among higher scores
  • Lower rated movies tend to have higher gross sales, likely due to the increased population that can attend the showings.
  • Some of the highest rated movies, however, have higher level ratings.
In [42]:
### top movies at each rating level by imdb score and gross, code based on https://plot.ly/python/basic-charts/
top_cr = df_new[['movie_title','imdb_score','gross','content_rating']]
top_cr = top_cr.sort_values('imdb_score',ascending=False)
px.scatter(top_cr, x="imdb_score", y="gross", hover_name='movie_title', hover_data=['movie_title','gross','imdb_score'], color='content_rating') 

IMDB Score through time

For this business problem, I want to get an idea of how IMDB score varies over time. To demonstrate this, I will create a scatter plot displaying the data points by year and IMDB score. For this question, I will sort the subset of data by release year first.

Based on this visualization, I can draw some conclusions about the data set:

  • Older movies tend to have higher average scores than newer movies, especially since thhe volume of movies produced over time has increased. This begs a new question, are thhe older movies fundamentally better than today's productions despite the drastic improvements in techhnology, or are they merely higher ranked because of nostalgia factors?
  • Around 1980 exponential growth in amount of movies released increased, but so did the amount of poorly rated movies.
  • In the past few years, there have been fewer poorly rated movies.
In [43]:
#imdb scores through time, code based on https://plot.ly/python/basic-charts/
yr_prof = df_new[['movie_title','imdb_score','gross','title_year']]
yr_prof = yr_prof.sort_values('title_year')
px.scatter(yr_prof, x="title_year", y="imdb_score", hover_name='movie_title', hover_data=['movie_title','gross','imdb_score'], color='imdb_score')

Density of movies IMDB score by rating levels

In this visualization, I am comparing the density plots of IMDB scores at the different movie rating levels. Based on my findings:

  • I can see that there is little variation between the different rating levels. Some have slightly lower IMDB scores for their level whereas others have higher averages. As shown by this visualization and another, the different levels have slightly different variances in their IMDB scores, too.
  • In the latter visualizations I show the count of movies in each level which proves thhat there are fer more R rated movies released than the other categories. There is also a trend that up to rating R, there is an increasing amount of movies released.
In [44]:
#density plots of content ratings by IMDB scores
sns.kdeplot(df_new[df_new['content_rating'] == 1]['imdb_score'])
sns.kdeplot(df_new[df_new['content_rating'] == 2]['imdb_score'])
sns.kdeplot(df_new[df_new['content_rating'] == 3]['imdb_score'])
sns.kdeplot(df_new[df_new['content_rating'] == 4]['imdb_score'])
sns.kdeplot(df_new[df_new['content_rating'] == 5]['imdb_score'])
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c20cfd780>
In [45]:
#count of each content rating level
rating = df_new.groupby('content_rating').size().reset_index()
rating = rating.replace({1: 'G', 2: 'PG', 3: 'PG-13', 4: 'R', 5: 'NC-17'}).rename(columns={0:'Count'})
rating
Out[45]:
content_rating Count
0 G 91
1 PG 576
2 PG-13 1314
3 R 1860
4 NC-17 16
In [46]:
rating.plot(kind='barh').set_yticklabels(rating.content_rating);

5. Correlation Analysis

I will be doing a correlation analysis to get a better understanding of the factors that are correlated with a successful movie. A successful movie will be determined as one having a high IMDB score. I will also look at the genres of movies that have the highest correlation with high IMDB scores as well. In the steps below, I will gather information about the data in the set to determine the correlations between the variables.

In [47]:
# data set basic statistics
df_new.describe()
Out[47]:
color num_critic_for_reviews duration director_facebook_likes gross num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating budget title_year imdb_score movie_facebook_likes profit roi
count 3857.000000 3857.000000 3857.000000 3857.000000 3.857000e+03 3.857000e+03 3857.000000 3857.000000 3857.000000 3857.000000 3857.000000 3.857000e+03 3857.000000 3857.000000 3857.000000 3.857000e+03 3857.000000
mean 0.966295 162.894450 109.901193 783.721027 5.091264e+07 1.023181e+05 11227.824734 326.388644 0.953332 0.789992 3.294011 4.520189e+07 2003.068188 6.463806 9081.565725 5.710748e+06 5.247754
std 0.180492 123.937595 22.737220 3025.924047 6.930377e+07 1.502522e+05 18916.591384 407.839385 0.210955 0.407367 0.810107 2.233096e+08 10.005510 1.053697 21267.885011 2.270307e+08 130.169327
min 0.000000 1.000000 34.000000 0.000000 1.620000e+02 5.000000e+00 0.000000 1.000000 0.000000 0.000000 1.000000 2.180000e+02 1920.000000 1.600000 0.000000 -1.221330e+10 -0.999982
25% 1.000000 72.000000 95.000000 10.000000 6.754898e+06 1.726100e+04 1815.000000 102.000000 1.000000 1.000000 3.000000 1.000000e+07 1999.000000 5.900000 0.000000 -1.022140e+07 -0.549085
50% 1.000000 134.000000 106.000000 58.000000 2.782987e+07 5.038900e+04 3871.000000 202.000000 1.000000 1.000000 3.000000 2.400000e+07 2005.000000 6.600000 206.000000 8.347120e+05 0.069895
75% 1.000000 221.000000 120.000000 222.000000 6.545231e+07 1.239400e+05 15944.000000 390.000000 1.000000 1.000000 4.000000 5.000000e+07 2010.000000 7.200000 11000.000000 2.465179e+07 1.223810
max 1.000000 813.000000 330.000000 23000.000000 7.605058e+08 1.689764e+06 656730.000000 5060.000000 1.000000 1.000000 5.000000 1.221550e+10 2016.000000 9.300000 349000.000000 5.235058e+08 7193.485533
In [48]:
#correlation analysis
df_new.corr()
Out[48]:
color num_critic_for_reviews duration director_facebook_likes gross num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating budget title_year imdb_score movie_facebook_likes profit roi
color 1.000000 0.006066 -0.046753 -0.056674 0.044109 -0.031591 0.031287 -0.063811 0.054033 0.044791 -0.058137 0.015836 0.166992 -0.114959 0.025366 -0.002112 -0.022378
num_critic_for_reviews 0.006066 1.000000 0.238270 0.179825 0.473935 0.599609 0.241401 0.572495 0.043733 -0.000465 0.052116 0.106876 0.392509 0.342064 0.701890 0.039550 0.033048
duration -0.046753 0.238270 1.000000 0.180540 0.249862 0.343514 0.122161 0.356641 -0.064707 -0.047456 0.082110 0.068946 -0.129491 0.359267 0.219544 0.008458 -0.032463
director_facebook_likes -0.056674 0.179825 0.180540 1.000000 0.143569 0.304951 0.122589 0.222208 0.027777 0.078042 0.009997 0.019171 -0.046758 0.190068 0.163245 0.024969 -0.006243
gross 0.044109 0.473935 0.249862 0.143569 1.000000 0.629347 0.233504 0.553589 0.143473 0.172968 -0.282101 0.101033 0.044862 0.212234 0.367425 0.205884 0.018778
num_voted_users -0.031591 0.599609 0.343514 0.304951 0.629347 1.000000 0.248950 0.782798 0.063371 0.066036 -0.002847 0.068307 0.016726 0.473321 0.519948 0.124928 0.010534
cast_total_facebook_likes 0.031287 0.241401 0.122161 0.122589 0.233504 0.248950 1.000000 0.183218 0.110945 0.102561 -0.016254 0.029198 0.119774 0.103131 0.201246 0.042561 -0.018529
num_user_for_reviews -0.063811 0.572495 0.356641 0.222208 0.553589 0.782798 0.183218 1.000000 0.064171 0.031479 0.033301 0.073241 0.012605 0.320049 0.375155 0.096949 0.068372
language 0.054033 0.043733 -0.064707 0.027777 0.143473 0.063371 0.110945 0.064171 1.000000 0.380840 -0.098756 -0.127252 -0.017168 -0.172338 0.035993 0.168963 0.004966
country 0.044791 -0.000465 -0.047456 0.078042 0.172968 0.066036 0.102561 0.031479 0.380840 1.000000 -0.072966 -0.046009 -0.012901 -0.139454 0.015870 0.098055 0.017756
content_rating -0.058137 0.052116 0.082110 0.009997 -0.282101 -0.002847 -0.016254 0.033301 -0.098756 -0.072966 1.000000 -0.036175 -0.006953 0.127250 0.006193 -0.050532 0.022638
budget 0.015836 0.106876 0.068946 0.019171 0.101033 0.068307 0.029198 0.073241 -0.127252 -0.046009 -0.036175 1.000000 0.044245 0.029407 0.053296 -0.952768 -0.007966
title_year 0.166992 0.392509 -0.129491 -0.046758 0.044862 0.016726 0.119774 0.012605 -0.017168 -0.012901 -0.006953 0.044245 1.000000 -0.127715 0.293981 -0.029825 -0.016247
imdb_score -0.114959 0.342064 0.359267 0.190068 0.212234 0.473321 0.103131 0.320049 -0.172338 -0.139454 0.127250 0.029407 -0.127715 1.000000 0.280089 0.035862 0.009893
movie_facebook_likes 0.025366 0.701890 0.219544 0.163245 0.367425 0.519948 0.201246 0.375155 0.035993 0.015870 0.006193 0.053296 0.293981 0.280089 1.000000 0.059739 -0.002804
profit -0.002112 0.039550 0.008458 0.024969 0.205884 0.124928 0.042561 0.096949 0.168963 0.098055 -0.050532 -0.952768 -0.029825 0.035862 0.059739 1.000000 0.013568
roi -0.022378 0.033048 -0.032463 -0.006243 0.018778 0.010534 -0.018529 0.068372 0.004966 0.017756 0.022638 -0.007966 -0.016247 0.009893 -0.002804 0.013568 1.000000
In [49]:
# show correlation plot
plt.figure(figsize=(8,8))
sns.heatmap(df_new.corr(), vmax=.8, square=True, annot=True, fmt=".2f");

Based on this analysis, it is clear that there are some strong correlations between well-performing movies based on IMDB score and variables from the data set.

Positive relationship with IMDB score:

  • Number of votes from users
  • movie duration
  • gross sales
  • number of critic reviews
  • Like on Movie Facebook page

This means that movies with higher performance scores likely also have a longer duration and higher gross sales. Also, movies that perform well also have a higher number of votes from users, more critic reviews, and more likes on the movie's Facebook page.

In [50]:
# show correlation plot
plt.figure(figsize=(12,12))
sns.heatmap(df1.corr(), vmax=.8, square=True, annot=True, fmt=".2f");

I also created a separate data set to analyze the correlation between movie genres and IMDB scores. From this analysis, it is clear that some genres are more closely correlated with IMDB scores than others.

Positive relationship with IMDB score:

  • Drama with 0.31
  • Biography 0.17

Negative relationship with IMDB score:

  • Comedy -0.21
  • Horror -0.17

While the scores, aren't incredibly strong, it does show that movies in the Drama and Biography genres tend to have higher IMDB scores, while movies in the Comedy and Horror genres tend to have slightly lower IMDB scores.

6. Storytelling

Based upon the findings of this report, I have been able to dig deeper into the performances of movies through time as listed on IMDB. The insights gained from this data analysis range from unexpected to highly suspected.

Time-based insights:

  • As time has gone on, movies in this data set have drastically increased their sales volumes. However, just because it is a newer movie doesn't mean it was successful. A variety of other factors have played into this such as IMDB score, profit, and ROI.
  • There is an increase in movies being released as time goes on in this data set. The visualizations I created hhelped show the exponential growth that could have resulted from multiple factors, but likely the increase in technological advancements, increasing commonality of attending movies in movie theaters, economic growth and even possibly GDP per capita growth.

IMDB score-based insights:

  • For the movies in this dataset, just because a movie has a high score, doesn't mean is was the most profitable or hasd the best returns. Some of the most critially acclaimed films had lower budgets than others and utilized them more effectively and received greater recognition and success than others.
  • Scores have on average decreased over time for movies in the data set. As the volume of movies released grows, reviewers must become more critical with their ratings to help distinguish between successful movies and unsuccessful ones.

Movie rating insights:

  • For movies in this data set, it is common for more critically acclaimed and successful movies to have higher rating levels than other movies. However, movies with greater profits and gross sales typically are of lower ratin levels as they can be seen by a wider audience.

Correlation insights:

  • Movie duration, gross sales, and number of reviews and votes from critics and users had the largest correlations with the results of an IMDB score which was used to benchmark a successful movie.

Data preparation insights:

  • When cleaning and preparing the data, a decent portion of the data had missing values. It would be a recommendation to IMDB to look back through their records and try to fill these to make thhe data set more comprehensive for future users.

7. Regression

Steps: Regression Feature selection Stats model significance Lasso model Feature selection Kbest Random forest regressor

I will start by removing two categorical columns that will not be helpful in the following regressions, classifications, and clustering steps.

In [51]:
#remove unecessary columns
df_reg = df_new.drop(['movie_title','director_name'],axis = 1)
In [52]:
#assigning columns to X and Y variables
y = df_reg['imdb_score']
X = df_reg[['color','num_critic_for_reviews','duration','director_facebook_likes','num_voted_users','cast_total_facebook_likes','num_user_for_reviews','language','country','content_rating','title_year','movie_facebook_likes','roi']]

To better inform the variables I select to put in my models and to better understand importances, I create a linear regression model and use Recursive Feature Selection.

In [53]:
#recursive feature selection for regression
lr = lm.LinearRegression()
rfe = RFE(lr, n_features_to_select=3)
rfe_y = rfe.fit(X,y)

print("Features sorted by their rank:")
print(sorted(zip([x for x in rfe.ranking_], X.columns)))
Features sorted by their rank:
[(1, 'color'), (1, 'country'), (1, 'language'), (2, 'content_rating'), (3, 'duration'), (4, 'title_year'), (5, 'num_critic_for_reviews'), (6, 'num_user_for_reviews'), (7, 'director_facebook_likes'), (8, 'roi'), (9, 'num_voted_users'), (10, 'movie_facebook_likes'), (11, 'cast_total_facebook_likes')]

I create four models, one regression with all variables to act as a benchmark, and three others that take into account the top variables from the RFE results.

In [54]:
#First Model
runs_reg_model1 = ols("imdb_score~num_voted_users+duration+director_facebook_likes+title_year",df_reg)
runs_reg1 = runs_reg_model1.fit()
#Second Model
runs_reg_model2 = ols("imdb_score~num_voted_users+duration+title_year",df_reg)
runs_reg2 = runs_reg_model2.fit()
#Third Model
runs_reg_model3 = ols("imdb_score~num_voted_users+duration+director_facebook_likes",df_reg)
runs_reg3 = runs_reg_model3.fit()
#Full model
runs_reg_model = ols("imdb_score~color+num_critic_for_reviews+duration+director_facebook_likes+num_voted_users+cast_total_facebook_likes+num_user_for_reviews+language+country+content_rating+title_year+movie_facebook_likes+roi",df_reg)
runs_reg = runs_reg_model.fit()
In [55]:
#view model results
print(runs_reg.summary())
print(runs_reg1.summary())
print(runs_reg2.summary())
print(runs_reg3.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.373
Model:                            OLS   Adj. R-squared:                  0.371
Method:                 Least Squares   F-statistic:                     176.2
Date:                Sat, 29 Feb 2020   Prob (F-statistic):               0.00
Time:                        18:10:16   Log-Likelihood:                -4772.5
No. Observations:                3857   AIC:                             9573.
Df Residuals:                    3843   BIC:                             9661.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept                    45.6579      3.205     14.245      0.000      39.374      51.942
color                        -0.3389      0.076     -4.446      0.000      -0.488      -0.189
num_critic_for_reviews        0.0021      0.000     10.979      0.000       0.002       0.002
duration                      0.0083      0.001     12.544      0.000       0.007       0.010
director_facebook_likes    8.591e-06   4.72e-06      1.822      0.069   -6.55e-07    1.78e-05
num_voted_users            3.427e-06   1.64e-07     20.924      0.000    3.11e-06    3.75e-06
cast_total_facebook_likes  1.127e-07    7.5e-07      0.150      0.881   -1.36e-06    1.58e-06
num_user_for_reviews         -0.0007   5.69e-05    -11.494      0.000      -0.001      -0.001
language                     -0.7037      0.070    -10.076      0.000      -0.841      -0.567
country                      -0.2514      0.036     -6.975      0.000      -0.322      -0.181
content_rating                0.1088      0.017      6.463      0.000       0.076       0.142
title_year                   -0.0198      0.002    -12.395      0.000      -0.023      -0.017
movie_facebook_likes      -1.514e-06    9.3e-07     -1.628      0.104   -3.34e-06    3.09e-07
roi                           0.0001      0.000      1.249      0.212   -7.41e-05       0.000
==============================================================================
Omnibus:                      519.166   Durbin-Watson:                   1.927
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              990.646
Skew:                          -0.851   Prob(JB):                    7.65e-216
Kurtosis:                       4.808   Cond. No.                     4.35e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.35e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.280
Model:                            OLS   Adj. R-squared:                  0.279
Method:                 Least Squares   F-statistic:                     374.4
Date:                Sat, 29 Feb 2020   Prob (F-statistic):          1.12e-272
Time:                        18:10:16   Log-Likelihood:                -5040.8
No. Observations:                3857   AIC:                         1.009e+04
Df Residuals:                    3852   BIC:                         1.012e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  27.6223      2.928      9.432      0.000      21.881      33.364
num_voted_users          2.781e-06   1.06e-07     26.211      0.000    2.57e-06    2.99e-06
duration                    0.0095      0.001     13.845      0.000       0.008       0.011
director_facebook_likes  9.489e-06   5.02e-06      1.890      0.059   -3.57e-07    1.93e-05
title_year                 -0.0112      0.001     -7.709      0.000      -0.014      -0.008
==============================================================================
Omnibus:                      570.795   Durbin-Watson:                   1.883
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1091.495
Skew:                          -0.922   Prob(JB):                    9.66e-238
Kurtosis:                       4.841   Cond. No.                     3.70e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.7e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.279
Model:                            OLS   Adj. R-squared:                  0.279
Method:                 Least Squares   F-statistic:                     497.6
Date:                Sat, 29 Feb 2020   Prob (F-statistic):          2.74e-273
Time:                        18:10:16   Log-Likelihood:                -5042.6
No. Observations:                3857   AIC:                         1.009e+04
Df Residuals:                    3853   BIC:                         1.012e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          27.8495      2.927      9.515      0.000      22.111      33.588
num_voted_users  2.835e-06   1.02e-07     27.695      0.000    2.63e-06    3.04e-06
duration            0.0096      0.001     14.030      0.000       0.008       0.011
title_year         -0.0113      0.001     -7.794      0.000      -0.014      -0.008
==============================================================================
Omnibus:                      571.063   Durbin-Watson:                   1.882
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1090.916
Skew:                          -0.923   Prob(JB):                    1.29e-237
Kurtosis:                       4.838   Cond. No.                     3.69e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.69e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.269
Model:                            OLS   Adj. R-squared:                  0.268
Method:                 Least Squares   F-statistic:                     472.2
Date:                Sat, 29 Feb 2020   Prob (F-statistic):          2.90e-261
Time:                        18:10:16   Log-Likelihood:                -5070.3
No. Observations:                3857   AIC:                         1.015e+04
Df Residuals:                    3853   BIC:                         1.017e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                   5.0550      0.073     68.854      0.000       4.911       5.199
num_voted_users           2.72e-06   1.07e-07     25.514      0.000    2.51e-06    2.93e-06
duration                    0.0102      0.001     14.960      0.000       0.009       0.012
director_facebook_likes  1.115e-05   5.06e-06      2.205      0.027    1.24e-06    2.11e-05
==============================================================================
Omnibus:                      534.065   Durbin-Watson:                   1.857
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              985.588
Skew:                          -0.885   Prob(JB):                    9.60e-215
Kurtosis:                       4.733   Cond. No.                     9.20e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.2e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

From these results, it is evident that there is some multicollinearity as expected from seeing the correlation plot from earlier. Using the correlation plot, I tried to remove at least one variable from the model that was potenially collinear with another one in the model. This did reduce the accuracy of the model, as expected from removing an additional variable.

One way to resolve the issue of multicollinearity is to use VIF to identify those variables that are strongly collinear and remove one of from the model. Despite this skill not being taught to us yet, I attempted to do this multiple times, but was unsuccessful in coding this given its level of difficulty and lack of information about how to properly code it for this given situation. To attempt to emulate this strategy using the skills I do have, I identified a few variables that are highly correlated, and tried to reduce the amount of these in each model where possible to see if I could get a better result.

In [56]:
#create regression model with RFE identified important variables and remove those that might be collinear
y = df_reg['imdb_score'] 
X = df_reg[['duration','num_voted_users','num_user_for_reviews','budget','title_year','gross']]
In [57]:
#fit regression model
sk_model1 = lm.LinearRegression()
sk_model1.fit(X, y)   
model1_y = sk_model1.predict(X)
In [58]:
# The coefficients
print('Coefficients: ', sk_model1.coef_)
# y-intercept
print("y-intercept ", sk_model1.intercept_)
Coefficients:  [ 1.05954687e-02  4.25168943e-06 -4.17153469e-04  9.00727223e-12
 -1.05672330e-02 -2.01842446e-09]
y-intercept  26.26972522499726
In [59]:
coef = ["%.3f" % i for i in sk_model1.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))
Out[59]:
[('duration', '0.011'),
 ('num_voted_users', '0.000'),
 ('num_user_for_reviews', '-0.000'),
 ('budget', '0.000'),
 ('title_year', '-0.011'),
 ('gross', '-0.000')]
In [60]:
print("mean square error: ", mean_squared_error(y, model1_y))
print("variance or r-squared: ", explained_variance_score(y, model1_y))
mean square error:  0.7746368083144852
variance or r-squared:  0.3021220455009458
In [61]:
# stats model output for previous regression model
runs_reg_model3 = ols("imdb_score~duration+num_voted_users+num_user_for_reviews+budget+title_year+gross",df_reg)
runs_reg3 = runs_reg_model3.fit()
print(runs_reg3.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.302
Model:                            OLS   Adj. R-squared:                  0.301
Method:                 Least Squares   F-statistic:                     277.8
Date:                Sat, 29 Feb 2020   Prob (F-statistic):          3.20e-296
Time:                        18:10:16   Log-Likelihood:                -4980.4
No. Observations:                3857   AIC:                             9975.
Df Residuals:                    3850   BIC:                         1.002e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               26.2697      2.889      9.094      0.000      20.606      31.933
duration                 0.0106      0.001     15.572      0.000       0.009       0.012
num_voted_users       4.252e-06   1.65e-07     25.826      0.000    3.93e-06    4.57e-06
num_user_for_reviews    -0.0004    5.7e-05     -7.323      0.000      -0.001      -0.000
budget                9.007e-12    6.4e-11      0.141      0.888   -1.16e-10    1.34e-10
title_year              -0.0106      0.001     -7.355      0.000      -0.013      -0.008
gross                -2.018e-09   2.67e-10     -7.572      0.000   -2.54e-09    -1.5e-09
==============================================================================
Omnibus:                      608.902   Durbin-Watson:                   1.902
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1227.065
Skew:                          -0.955   Prob(JB):                    3.52e-267
Kurtosis:                       4.997   Cond. No.                     4.65e+10
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.65e+10. This might indicate that there are
strong multicollinearity or other numerical problems.

While there are still many things to be improved, I did manage to get a better variance score withh a smaller numbeer of variables than in previous attempts before taking into account the variables that may be collinear.

In [62]:
#validation: more accurate way to validate model for deployment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
In [63]:
model1 = lm.LinearRegression()
model1.fit(X_train, y_train)   
pred_y = model1.predict(X_test)
In [64]:
coef = ["%.3f" % i for i in model1.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))
Out[64]:
[('duration', '0.012'),
 ('num_voted_users', '0.000'),
 ('num_user_for_reviews', '-0.000'),
 ('budget', '0.000'),
 ('title_year', '-0.011'),
 ('gross', '-0.000')]
In [65]:
print("mean square error: ", mean_squared_error(y_test, pred_y))
print("variance or r-squared: ", explained_variance_score(y_test, pred_y))
mean square error:  0.8288285947554137
variance or r-squared:  0.29194020317121105

Deploying this model would not likely be super successful, as I expected, due to the problems with multicollinearity.

In [66]:
# high values indicate multicollinearity
print(np.linalg.cond(runs_reg1.model.exog)) #multicollinearity
print(np.linalg.cond(runs_reg2.model.exog))
print(np.linalg.cond(runs_reg3.model.exog))
36953750.66208699
36921773.81727428
46540458171.603096

As shown above, the models show high values indicating that multicollinearity is present. I attempt to use the lasso method which penalizes having too many predictors.

In [67]:
# Fit the model

mod1 =lm.Lasso(alpha=0.1)           #higher alpha (penality parameter), fewer predictors
mod1.fit(X, y)
mod1_y = mod1.predict(X)
In [68]:
#get model info
mod1
Out[68]:
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)
In [69]:
#get coeff and y-int
print('Coefficients: ', mod1.coef_)
print("y-intercept ", mod1.intercept_)
Coefficients:  [ 1.04349344e-02  4.25292550e-06 -4.13326760e-04  7.88982668e-12
 -9.61437432e-03 -2.02522548e-09]
y-intercept  24.37774855941467
In [70]:
# zip values in columns
coef = ["%.3f" % i for i in mod1.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))
Out[70]:
[('duration', '0.010'),
 ('num_voted_users', '0.000'),
 ('num_user_for_reviews', '-0.000'),
 ('budget', '0.000'),
 ('title_year', '-0.010'),
 ('gross', '-0.000')]
In [71]:
# show MSE and variance
print("mean square error: ", mean_squared_error(y, mod1_y))
print("variance or r-squared: ", explained_variance_score(y, mod1_y))
mean square error:  0.774748530178955
variance or r-squared:  0.30202139416936447

Like the other models, there continues to be an issue with multicollinearity in each model, and changing them isn't bound to change this.

Then, using K-Best regression models and feature selection below, I make another attempt.

In [72]:
#select only 4 X variables
X_new = SelectKBest(f_regression, k=4).fit_transform(X, y)
X_new
Out[72]:
array([[1.78000000e+02, 8.86204000e+05, 3.05400000e+03, 7.60505847e+08],
       [1.69000000e+02, 4.71220000e+05, 1.23800000e+03, 3.09404152e+08],
       [1.48000000e+02, 2.75868000e+05, 9.94000000e+02, 2.00074175e+08],
       ...,
       [8.10000000e+01, 5.20550000e+04, 1.30000000e+02, 2.04092000e+06],
       [9.50000000e+01, 1.33800000e+03, 1.40000000e+01, 4.58400000e+03],
       [9.00000000e+01, 4.28500000e+03, 8.40000000e+01, 8.52220000e+04]])
In [73]:
# this helps us find out which variables are selected

selector = SelectKBest(f_regression, k=4).fit(X, y)
idxs_selected = selector.get_support(indices=True)
print(idxs_selected)
[0 1 2 5]
In [74]:
# index(columns) 0, 1, 2 and 5 refer to constant, duration, num_voted_users, and title year respectively 
X.head(2)
Out[74]:
duration num_voted_users num_user_for_reviews budget title_year gross
0 178.0 886204 3054.0 237000000.0 2009.0 760505847.0
1 169.0 471220 1238.0 300000000.0 2007.0 309404152.0
In [75]:
#new model regression
mod2 = lm.LinearRegression()
mod2.fit(X_new, y)
mod2_y = mod2.predict(X_new)

print("mean square error: ", mean_squared_error(y, mod2_y))
print("variance or r-squared: ", explained_variance_score(y, mod2_y))
mean square error:  0.7855308996553768
variance or r-squared:  0.2923074509716077

This model again receives similar results to the other regression models.

In [76]:
# use f_regression with k = 3 and develop a new regression model
#f_Regression (Feature Selection)

X_new = SelectKBest(f_regression, k=3).fit_transform(X, y)
X_new
Out[76]:
array([[1.78000e+02, 8.86204e+05, 3.05400e+03],
       [1.69000e+02, 4.71220e+05, 1.23800e+03],
       [1.48000e+02, 2.75868e+05, 9.94000e+02],
       ...,
       [8.10000e+01, 5.20550e+04, 1.30000e+02],
       [9.50000e+01, 1.33800e+03, 1.40000e+01],
       [9.00000e+01, 4.28500e+03, 8.40000e+01]])
In [77]:
#regression model 2
mod2 = lm.LinearRegression()
mod2.fit(X_new, y)
mod2_y = mod2.predict(X_new)

print("mean square error: ", mean_squared_error(y, mod2_y))
print("variance or r-squared: ", explained_variance_score(y, mod2_y))
mean square error:  0.7982530667192776
variance or r-squared:  0.280845924960897

After this attempt, I decide it is time to use the Random Forest regressor model. This model, while having more of a blackbox method, often produces stronger results. It is difficult to sometimes explain where it derives its feature importances from though.

In [78]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

#assigning columns to X and Y variables
y = df_reg['imdb_score'] 
X = df_reg[['color','num_critic_for_reviews','duration','director_facebook_likes','num_voted_users','cast_total_facebook_likes','num_user_for_reviews','language','country','content_rating','budget','title_year','movie_facebook_likes','gross','roi']]


#regr = RandomForestRegressor(n_estimators=100, random_state=0)
regr = RandomForestRegressor()
regr.fit(X, y)
regr_predicted = regr.predict(X)

print("mean square error: ", mean_squared_error(y, regr_predicted))
print("variance or r-squared: ", explained_variance_score(y, regr_predicted))
//anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning:

The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.

mean square error:  0.09410661135597614
variance or r-squared:  0.9152899210908882

As you can see, this has a much stronger result given it's high r-squared. The following is the list of feature importances in this model.

In [79]:
#RFR feature importances
sorted(zip(regr.feature_importances_, X.columns))
Out[79]:
[(0.0008634617351584697, 'color'),
 (0.006392320661751883, 'language'),
 (0.00816033949594711, 'country'),
 (0.02214822532297739, 'content_rating'),
 (0.02738019765685876, 'movie_facebook_likes'),
 (0.03388511171866767, 'director_facebook_likes'),
 (0.03925128638818624, 'roi'),
 (0.04598398503165244, 'title_year'),
 (0.053789708921303074, 'num_critic_for_reviews'),
 (0.05665666145076041, 'cast_total_facebook_likes'),
 (0.05748904483719986, 'gross'),
 (0.07505070080757435, 'num_user_for_reviews'),
 (0.12083618287604064, 'budget'),
 (0.12368793925671227, 'duration'),
 (0.32842483383920956, 'num_voted_users')]

Using these results, I created a condensed model with fewer variables based on the feature importances listed above.

In [80]:
#new random forest regressor smaller model
y = df_reg['imdb_score'] 
X = df_reg[['duration','num_voted_users','num_user_for_reviews','budget','gross']]

regr = RandomForestRegressor(n_estimators=100, random_state=0)
regr.fit(X, y)
regr_predicted = regr.predict(X)

print("mean square error: ", mean_squared_error(y, regr_predicted))
print("variance or r-squared: ", explained_variance_score(y, regr_predicted))
mean square error:  0.07795061809696664
variance or r-squared:  0.9298118049096264

Impressively enough, this actually improves the r-squared from the previous model. I will display the feature importancesfor this model below.

In [81]:
sorted(zip(regr.feature_importances_, X.columns))
Out[81]:
[(0.12748359265614725, 'num_user_for_reviews'),
 (0.14183825658739652, 'gross'),
 (0.16418129417490873, 'duration'),
 (0.1698133992905022, 'budget'),
 (0.39668345729104526, 'num_voted_users')]
In [82]:
feature_importances = pd.DataFrame(regr.feature_importances_, index = X.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)

feature_importances
Out[82]:
importance
num_voted_users 0.396683
budget 0.169813
duration 0.164181
gross 0.141838
num_user_for_reviews 0.127484

Though this model does perform significantly better than the others, this also brings into question what this model did so differently. Since it is a blackbox method, I'm not sure that this model will be the most effective to deploy and make predictions for the data set in the long run, especially since multicollinearity does influence regression models, but less so with classification and clustering. I will try these new types of methods that may provide some better insight.

8. Classification

In this section, I attempt to find a model with a stronger accuracy for deployment and prediction later for this data set. To start, I create a new column inthe data set that bins the IMDB scores into 4 categories by rating: less than 4, 4 to 6, 6 to 8, and 8 to 10; respectively Bad, Okay, Good, and Excellent.

In [83]:
#create bins for IMDB scores
df_reg['imdb_score_bin'] = pd.cut(df_reg['imdb_score'], bins=[0, 4, 6, 8, 10], labels=['0','1','2','3'])
In [84]:
# declare X variables and y variable, and remove both y variables from independent data set
y = df_reg['imdb_score_bin'] 
X = df_reg.drop(['imdb_score_bin','imdb_score'], axis =1)
print(y.shape, X.shape)
(3857,) (3857, 16)

In the next steps, I use the Decision trees classifier as a model, train it, and deploy it to see how well it can predict for the data set.

In [85]:
# split validation
#validation: more accurate way to validate model for deployment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize DecisionTreeClassifier() ... name your decision model "dt"
dt = DecisionTreeClassifier()
dt
# Train a decision tree model
dt= dt.fit(X_train, y_train)
In [86]:
#size of train and test data sets
print(len(X_train), len(X_test))
2699 1158
In [87]:
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print(metrics.accuracy_score(y_test, dt.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, dt.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, dt.predict(X_test)))

# y-test is the acual y value in the testing dataset
# dt.predict(X_test) is the predicted y value generated by your model
# If they are same, we can say your model is accurate.
0.6658031088082902
--------------------------------------------------------
[[  6  19   6   0]
 [ 16 165 135   0]
 [  7 165 571  13]
 [  0   1  25  29]]
--------------------------------------------------------
              precision    recall  f1-score   support

           0       0.21      0.19      0.20        31
           1       0.47      0.52      0.50       316
           2       0.77      0.76      0.76       756
           3       0.69      0.53      0.60        55

    accuracy                           0.67      1158
   macro avg       0.54      0.50      0.51      1158
weighted avg       0.67      0.67      0.67      1158

In [88]:
import scikitplot as skplt

skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test), y_pred=dt.predict(X_test))
plt.show()

This decision tree does significantly better than the previous regression models with an r-squared or accuracy of 67%. Now I will try using the K Nearest Neighbors model to classify against the data.

In [89]:
# evaluate the model by splitting into train and test sets & develop knn model

# split validation
#validation: more accurate way to validate model for deployment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize knn()
knn = KNeighborsClassifier()

# Train a decision tree model
knn= knn.fit(X_train, y_train)

knn
Out[89]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')
In [90]:
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print(metrics.accuracy_score(y_test, knn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, knn.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, knn.predict(X_test)))
0.5958549222797928
--------------------------------------------------------
[[  0   6  25   0]
 [  0  86 230   0]
 [  3 146 604   3]
 [  0   4  51   0]]
--------------------------------------------------------
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        31
           1       0.36      0.27      0.31       316
           2       0.66      0.80      0.73       756
           3       0.00      0.00      0.00        55

    accuracy                           0.60      1158
   macro avg       0.25      0.27      0.26      1158
weighted avg       0.53      0.60      0.56      1158

The result of this model is slightly lower, so I will use GridSearch CV to make adjustments to this model.

In [91]:
#create a dictionary of all values we want to test for n_neighbors
params_knn = {'n_neighbors': np.arange(1, 25)}

#use gridsearch to test all values for n_neighbors
knn_gs = GridSearchCV(knn, params_knn, cv=5, iid=False)

#fit model to training data
knn_gs.fit(X_train, y_train)
Out[91]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid=False, n_jobs=None,
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [92]:
#save best model
knn_best = knn_gs.best_estimator_

#check best n_neigbors value
print(knn_gs.best_score_)
print(knn_gs.best_params_)
print(knn_gs.best_estimator_)
0.6528337799766372
{'n_neighbors': 23}
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=23, p=2,
                     weights='uniform')

Implementing the best model from KNN enabled the accuracy to improve to be nearly equivalent with the first decision tree model. Since that initial decison tree model is a full model, I'd like to make a simpler tree.

In [93]:
# A simpler decision tree(max_depth=3, min_samples_leaf=5)
#validation: more accurate way to validate model for deployment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize DecisionTreeClassifier()
dt_simple = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
dt_simple
# Train a decision tree model
dt_simple = dt_simple.fit(X_train, y_train)

dt_simple
Out[93]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
In [94]:
# Find out the performance of this model & interpret the results

#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print(metrics.accuracy_score(y_test, dt_simple.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, dt_simple.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, dt_simple.predict(X_test)))
0.6925734024179621
--------------------------------------------------------
[[  0  17  14   0]
 [  0 156 160   0]
 [  0 119 629   8]
 [  0   0  38  17]]
--------------------------------------------------------
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        31
           1       0.53      0.49      0.51       316
           2       0.75      0.83      0.79       756
           3       0.68      0.31      0.42        55

    accuracy                           0.69      1158
   macro avg       0.49      0.41      0.43      1158
weighted avg       0.67      0.69      0.67      1158

//anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.

This simpler decision tree has an even higher accuracy than the KNN model and full deciosion tree model. I display the new decision tree model itself below.

In [95]:
# decision tree visual
from graphviz import Source
from sklearn import tree
Source( tree.export_graphviz(dt_simple, out_file=None, feature_names=X.columns))
Out[95]:
Tree 0 num_voted_users <= 85651.0 gini = 0.489 samples = 2699 value = [65, 760, 1769, 105] 1 duration <= 107.5 gini = 0.519 samples = 1776 value = [63, 671, 1031, 11] 0->1 True 8 num_voted_users <= 536866.5 gini = 0.341 samples = 923 value = [2, 89, 738, 94] 0->8 False 2 gross <= 7880198.0 gini = 0.552 samples = 1136 value = [58, 521, 551, 6] 1->2 5 budget <= 36500000.0 gini = 0.382 samples = 640 value = [5, 150, 480, 5] 1->5 3 gini = 0.502 samples = 468 value = [22, 143, 297, 6] 2->3 4 gini = 0.532 samples = 668 value = [36, 378, 254, 0] 2->4 6 gini = 0.305 samples = 442 value = [3, 73, 361, 5] 5->6 7 gini = 0.487 samples = 198 value = [2, 77, 119, 0] 5->7 9 title_year <= 1988.5 gini = 0.272 samples = 867 value = [2, 89, 733, 43] 8->9 12 movie_facebook_likes <= 139000.0 gini = 0.163 samples = 56 value = [0, 0, 5, 51] 8->12 10 gini = 0.481 samples = 64 value = [0, 2, 41, 21] 9->10 11 gini = 0.245 samples = 803 value = [2, 87, 692, 22] 9->11 13 gini = 0.075 samples = 51 value = [0, 0, 2, 49] 12->13 14 gini = 0.48 samples = 5 value = [0, 0, 3, 2] 12->14

Like in the regression section, I will now use the Random Forest Classifier to see how well this model performs. Since it is a classification, the multicollinearity issue that occurred before will not have the same impact.

In [96]:
# declare X variables and y variable

y = df_reg['imdb_score_bin'] 
X = df_reg.drop(['imdb_score_bin','imdb_score'], axis =1)

print(y.shape, X.shape)
(3857,) (3857, 16)
In [97]:
# split validation
#validation: more accurate way to validate model for deployment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = RandomForestClassifier(n_estimators=20)    #building 20 decision trees
clf=clf.fit(X_train, y_train)
clf.score(X_test, y_test)
Out[97]:
0.7599309153713298
In [98]:
# generate evaluation metrics
print(metrics.accuracy_score(y_test, clf.predict(X_test))) #overall accuracy
print(metrics.confusion_matrix(y_test, clf.predict(X_test)))
print(metrics.classification_report(y_test, clf.predict(X_test)))
0.7599309153713298
[[  0  22   9   0]
 [  0 186 130   0]
 [  1  82 668   5]
 [  0   0  29  26]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        31
           1       0.64      0.59      0.61       316
           2       0.80      0.88      0.84       756
           3       0.84      0.47      0.60        55

    accuracy                           0.76      1158
   macro avg       0.57      0.49      0.51      1158
weighted avg       0.74      0.76      0.74      1158

As I expected the random forest classifier does show an improved accuracy to the previous decision trees. Since its accuracy isn't significantly higher, I can be more secure in knowing that this wasn't due to multicollinearity problems.

I then decide to deploy this model to predict the IMDB score bins of the data set.

In [99]:
score=df_reg.drop(['imdb_score_bin','imdb_score'], axis =1)
score.head(2)
Out[99]:
color num_critic_for_reviews duration director_facebook_likes gross num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating budget title_year movie_facebook_likes profit roi
0 1.0 723.0 178.0 0.0 760505847.0 886204 4834 3054.0 1.0 1 3.0 237000000.0 2009.0 33000 523505847.0 2.208885
1 1.0 302.0 169.0 563.0 309404152.0 471220 48350 1238.0 1.0 1 3.0 300000000.0 2007.0 0 9404152.0 0.031347
In [100]:
# deploy your model for real world

predictedY = clf.predict(score)
print(predictedY)
['2' '2' '2' ... '2' '2' '2']
In [101]:
#combine the predicted Y value with the scoring dataset
predictedY = pd.DataFrame(predictedY, columns=['predicted Y'])
predictedY.head()
Out[101]:
predicted Y
0 2
1 2
2 2
3 3
4 2
In [102]:
#join the predicted and initial data set
data1 = score.join(predictedY) 
data1.head()
Out[102]:
color num_critic_for_reviews duration director_facebook_likes gross num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating budget title_year movie_facebook_likes profit roi predicted Y
0 1.0 723.0 178.0 0.0 760505847.0 886204 4834 3054.0 1.0 1 3.0 237000000.0 2009.0 33000 523505847.0 2.208885 2
1 1.0 302.0 169.0 563.0 309404152.0 471220 48350 1238.0 1.0 1 3.0 300000000.0 2007.0 0 9404152.0 0.031347 2
2 1.0 602.0 148.0 0.0 200074175.0 275868 11700 994.0 1.0 0 3.0 245000000.0 2015.0 85000 -44925825.0 -0.183371 2
3 1.0 813.0 164.0 22000.0 448130642.0 1144337 106759 2701.0 1.0 1 3.0 250000000.0 2012.0 164000 198130642.0 0.792523 3
5 1.0 462.0 132.0 475.0 73058679.0 212204 1873 738.0 1.0 1 3.0 263700000.0 2012.0 24000 -190641321.0 -0.722948 2
In [103]:
#number of movies in each IMDB bin
data1.groupby('predicted Y').size()
Out[103]:
predicted Y
0      58
1     900
2    2203
3     120
dtype: int64
In [104]:
#percent of movies in each bin
data1.groupby('predicted Y').size()/len(data1)
Out[104]:
predicted Y
0    0.015038
1    0.233342
2    0.571169
3    0.031112
dtype: float64
In [105]:
#visualization of number of movies in each bin (classification)
data1.groupby('predicted Y').size().plot(kind='bar');

From the above predictions, it is evident they reflect the real scores from the data set. There are a large number of movies classified as good, followed by movies classified as okay. There is a realitively small group that are classified as excellent, meaning they are quite selective with the movies that obtain this distinction.

9. Clustering

In this section, I will look at clustering models to determine important variables in determining IMDB score, and identifying clusters that describe the movies in the data set. To prepare the data for clusterinng models, I need to do a variance test, then normalize the data before plugging it into the model.

In [106]:
# variance test
df_cl = df_reg[['color','num_critic_for_reviews','duration','director_facebook_likes','num_voted_users','cast_total_facebook_likes','num_user_for_reviews','language','country','content_rating','title_year','movie_facebook_likes','roi']]
df_cl.var()
Out[106]:
color                        3.257737e-02
num_critic_for_reviews       1.536053e+04
duration                     5.169812e+02
director_facebook_likes      9.156216e+06
num_voted_users              2.257573e+10
cast_total_facebook_likes    3.578374e+08
num_user_for_reviews         1.663330e+05
language                     4.450199e-02
country                      1.659475e-01
content_rating               6.562738e-01
title_year                   1.001102e+02
movie_facebook_likes         4.523229e+08
roi                          1.694405e+04
dtype: float64
In [107]:
#normalize data
normalized_df=(df_cl-df_cl.mean())/df_cl.std()
normalized_df.head()
Out[107]:
color num_critic_for_reviews duration director_facebook_likes num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating title_year movie_facebook_likes roi
0 0.186739 4.519255 2.995037 -0.259002 5.217133 -0.338001 6.687955 0.221224 0.515525 -0.362928 0.592855 1.124627 -0.023346
1 0.186739 1.122384 2.599210 -0.072943 2.455218 1.962414 2.235221 0.221224 0.515525 -0.362928 0.392965 -0.427008 -0.040074
2 0.186739 3.542957 1.675614 -0.259002 1.155057 0.024961 1.636947 0.221224 -1.939266 -0.362928 1.192524 3.569628 -0.041724
3 0.186739 5.245427 2.379306 7.011504 6.935131 5.050126 5.822418 0.221224 0.515525 -0.362928 0.892689 7.284149 -0.034226
5 0.186739 2.413356 0.971922 -0.102025 0.731343 -0.494530 1.009249 0.221224 0.515525 -0.362928 0.892689 0.701454 -0.045869

The Elbow method uses Kmeans to help determine how many clusters or profiles should be used when creating the models. Based on the image below, it appears that either 4 or 5 clusters would be best for the models.

In [108]:
#The Elbow method

#https://www.packtpub.com/big-data-and-business-intelligence/mastering-machine-learning-scikit-learn#

#http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
#Computes distance between each pair of the two collections of inputs
from scipy.spatial.distance import cdist 

K = range(1, 15) 

meandistortions = []

for k in K: 
    kmeans = KMeans(n_clusters=k, random_state=1) 
    kmeans.fit(normalized_df) 
    meandistortions.append(sum(np.min(cdist(normalized_df, kmeans.cluster_centers_, 'euclidean'), axis=1)) / normalized_df.shape[0]) 

plt.plot(K, meandistortions, 'bx-') 
plt.xlabel('k') 
plt.ylabel('Average distortion') 
Out[108]:
Text(0, 0.5, 'Average distortion')

First I make a model with 4 clusters.

In [109]:
# clustering analysis using k-means
from sklearn.cluster import KMeans

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances
k_means = KMeans(init='k-means++', n_clusters=4, random_state=0)
In [110]:
#fit data into model
k_means.fit(normalized_df)
Out[110]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)
In [111]:
# find out cluster centers
k_means.cluster_centers_
Out[111]:
array([[-5.35367143, -0.03247779,  0.2502995 ,  0.30341213,  0.16912708,
        -0.16750018,  0.34162159, -0.28927461, -0.23979514,  0.31124639,
        -0.89401845, -0.13580373,  0.11980607],
       [ 0.18673928, -0.30174054, -0.16527696, -0.11933357, -0.29203523,
        -0.08607165, -0.26541583,  0.22122445,  0.08406714, -0.04265267,
        -0.05871143, -0.26503366, -0.01758379],
       [ 0.18673928,  1.64327709,  0.72631832,  0.58536508,  1.57014826,
         0.62896613,  1.40610566,  0.21298037,  0.11422026,  0.02778749,
         0.45762472,  1.45449754,  0.07009654],
       [ 0.18673928, -0.23856504,  0.26369876, -0.11866141, -0.31825776,
        -0.50172409, -0.36439072, -4.51912397, -1.73098079,  0.43008299,
         0.17248013, -0.16379436, -0.02034853]])
In [112]:
# convert cluster labels to dataframe
c = pd.DataFrame(k_means.labels_, columns = ['cluster'])
c.head()
Out[112]:
cluster
0 2
1 2
2 2
3 2
4 2
In [113]:
# reset index for cluster set
c = c.reset_index(drop=True)
In [114]:
#verify lengths of both data set are equal before joining
len(normalized_df)
Out[114]:
3857
In [115]:
#verify lengths of both data set are equal before joining
len(c)
Out[115]:
3857
In [116]:
#join normalized data and cluster data
df2 = normalized_df.join(c)
df2.head()
Out[116]:
color num_critic_for_reviews duration director_facebook_likes num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating title_year movie_facebook_likes roi cluster
0 0.186739 4.519255 2.995037 -0.259002 5.217133 -0.338001 6.687955 0.221224 0.515525 -0.362928 0.592855 1.124627 -0.023346 2.0
1 0.186739 1.122384 2.599210 -0.072943 2.455218 1.962414 2.235221 0.221224 0.515525 -0.362928 0.392965 -0.427008 -0.040074 2.0
2 0.186739 3.542957 1.675614 -0.259002 1.155057 0.024961 1.636947 0.221224 -1.939266 -0.362928 1.192524 3.569628 -0.041724 2.0
3 0.186739 5.245427 2.379306 7.011504 6.935131 5.050126 5.822418 0.221224 0.515525 -0.362928 0.892689 7.284149 -0.034226 2.0
5 0.186739 2.413356 0.971922 -0.102025 0.731343 -0.494530 1.009249 0.221224 0.515525 -0.362928 0.892689 0.701454 -0.045869 2.0
In [117]:
#join original data and cluster data
data_cl = df_cl.join(c)
data_cl.head()
Out[117]:
color num_critic_for_reviews duration director_facebook_likes num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating title_year movie_facebook_likes roi cluster
0 1.0 723.0 178.0 0.0 886204 4834 3054.0 1.0 1 3.0 2009.0 33000 2.208885 2.0
1 1.0 302.0 169.0 563.0 471220 48350 1238.0 1.0 1 3.0 2007.0 0 0.031347 2.0
2 1.0 602.0 148.0 0.0 275868 11700 994.0 1.0 0 3.0 2015.0 85000 -0.183371 2.0
3 1.0 813.0 164.0 22000.0 1144337 106759 2701.0 1.0 1 3.0 2012.0 164000 0.792523 2.0
5 1.0 462.0 132.0 475.0 212204 1873 738.0 1.0 1 3.0 2012.0 24000 -0.722948 2.0
In [118]:
#group new data set by cluster and get average value for each variable
data_cl.groupby(['cluster']).mean() 
Out[118]:
color num_critic_for_reviews duration director_facebook_likes num_voted_users cast_total_facebook_likes num_user_for_reviews language country content_rating title_year movie_facebook_likes roi
cluster
0.0 0.971698 159.669811 111.141509 707.896226 93107.518868 10694.830189 283.518868 0.981132 0.773585 3.349057 2001.056604 10484.339623 1.155883
1.0 0.972876 163.621460 110.536099 896.870762 103841.509374 11873.397686 321.554846 0.963303 0.800160 3.240527 2003.498604 8664.121260 1.009161
2.0 0.971857 219.403377 117.332083 806.726079 162720.821764 15212.472795 509.041276 0.979362 0.829268 3.090056 2004.973734 16167.951220 0.592303
3.0 0.977778 140.066667 111.296296 779.851852 87207.896296 7761.192593 306.555556 0.948148 0.711111 3.392593 2000.844444 6281.444444 1.533206
In [119]:
#check how many observations in each cluster
df2.groupby('cluster').size()
Out[119]:
cluster
0.0     106
1.0    2507
2.0     533
3.0     135
dtype: int64

The 4 profiles appears to be a good fit for the data set. When I tried doing 5 profiles, only one movie was assigned to the 5th cluster, so I determined that 4 was better for this model.

A description of the Movie Profiles

All Profiles:

  • unlikely to include Black & White movies
  • Mostly filmed in English
  • Average Title year is in the early 2000's

Profile 1:

  • Second lowest number of critic reviews
  • Average duration
  • Lowest number of director facebook likes
  • Second lowest number of voted users
  • Second lowest number of cast facebook likes
  • Lowest number of user reviews
  • Most likely to be filmed in English
  • Second lowest likelihood of produced in US
  • Second highest average content rating
  • Second oldest title year
  • Second highest number of movie facebook likes
  • Second highest ROI

Profile 2:

  • Second most number of critic reviews
  • Lowest average duration
  • Lowest number of director facebook likes
  • Second most number of voted users
  • Second most number of cast facebook likes
  • Second most number of user reviews
  • Third most likely to be filmed in English
  • Most likely to be produced in US
  • Second lowest average content rating
  • Second most recent title year
  • Second lowest number of movie facebook likes
  • ROI: average broke even

Profile 3:

  • Highest number of critic reviews
  • Highest average duration
  • Second highest number of director facebook likes
  • Highest most number of voted users
  • Highest number of cast facebook likes
  • Highest number of user reviews
  • Second most likely to be filmed in English
  • Second most likelihood of produced in US
  • Lowest content rating
  • Most recent average title year
  • Highest number of movie facebook likes
  • Lowest ROI, lost money on production

Profile 4:

  • Lowest number of critic reviews
  • Average duration
  • Lowest number of director facebook likes
  • Lowest number of voted users
  • Lowest number of cast facebook likes
  • Second lowest number of user reviews
  • Least likely to be filmed in English (still highly likely)
  • Least likely to be produced in US
  • Highest content rating
  • Oldest average title year
  • Lowest number of movie facebook likes
  • Highest ROI, gained most return on production

I then do hierarchial clustering to show the 4 profiles.

In [120]:
#hierarchial clustering
#agglomerative clustering
np.random.seed(1) # setting random seed to get the same results each time.

agg= AgglomerativeClustering(n_clusters=4, linkage='ward').fit(normalized_df)
agg.labels_
Out[120]:
array([0, 0, 0, ..., 3, 2, 2])
In [121]:
#hierarcial cluster dendrogram
plt.figure(figsize=(16,8))

plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')

linkage_matrix = ward(normalized_df)
dendrogram(linkage_matrix, 
           truncate_mode='lastp',  # show only the last p merged clusters
           p=4,  # show only the last p merged clusters
           #show_leaf_counts=False,  # otherwise numbers in brackets are counts
           leaf_rotation=90.,
           leaf_font_size=12.,
           show_contracted=True,  # to get a distribution impression in truncated branches
           orientation="top")
plt.tight_layout() # fixes margins

This shows how and which clusters would combine if we continually reduced the number of clusters.

10. Storytelling:

Key Findings:

Regression: Multicollinnearity was pretty evident in this data set when conducting regression models. Though I tried to adjust for it, the r-squared remained relatively close between each model. The only regression model witht he highest r-squared or supposed accuracy was the Random Forest Regressor. Since this model is founded on an algorithm tht utilizes decision trees in its regression, it is more likely to have higher accuracy levels since it is less likely to be affected by multicollinearity. That being said, since the other models were around a 0.3 r square, and the Random Forest regressor had a 0.91 r-square, I am not sure what this blackbox method model did incredibly different from the other regression models such as lasso, RFE and others. Therefore, I decided that for this data set, one can either assume the Random Forest Regressor is unwittingly accurate, or this difference can make the model be considered less effective for this specific data set.

Classification: For the classification models used, I did a full decision tree, a simplified decision tree, KNN model, and Random Forest Classifier. Of the results for these, the model with the highest accuracy was the Random Forest Classifier in predicting the correct classification of the movies by IMDB score. It received a r-square score of 0.74 which is a pretty good model. In comparison, the accuracy of these models were much higher than that of the regressions, which means I think a classification model would be better suited to predict IMDB score level than regression models. The next best performing model was the simplified decision tree with a r-square of 0.69. Since this model and the Random Forest Classifier were much closer than that of the regression models and Random Forest regressor, it is more safe to assume that the Random Forest Classifier is still the best model.

Clustering: From the clustering analyses, I was able to determine that 4 profiels would be the best suited for this data set. While the Elbow method showed 5 as potentially being better, this variation only had one movie in the 5th cluster or profile, and I determined that this didn't do much incrementaly to improve the model so I reduced it back to 4 profiles. A basic overview of the profiles was: profile 1 being generally average in duration and age, yet it had some of the lwest number of critic an user reviews and second highest ROI. Profile 2 was typically shorter films with fairly good numbers fo votes and reviews, that had on average a breakeven on their ROI. Profile 3 garnered the most consumer and public attention with the highest number of votes, ratings, likes, and longest duration, however these movies on average lost money on the production with a negative ROI. Profile 4 were likely some of the smaller box office films with the lowest numbers of likes, reviews, and votes meaning they didn't have as much of the public's attention. These movies, however, did manage to have the highest ROI's of the 4 profiles.

Best Classification model: As explained above, the Random Forest Classifier would be the most accurate model to use from the classification section. If the person deploying these models didn't prefer to use Random forest models because of their blackbox method that doesn't show clearly the way variables are chosen and ranked, it would also be adviseable to select the next best model, the simplified decision tree.

Most important variables: Based on both regression and classification models, some of the most important variables in determining high IMDB scores and movie success were: duration, title year, number of voted users, gross, budget and user reviews. These variables were ever-present in the most successful models like the simple decision tree, Random Forest Regressor and Classifier models.

Recommendations for movie producers: For movie producers, I would recommend including a strong marketing campaign as those movies with the most reviews and votes and likes tended to have some of the best ratings. I also advise that movies that are slightly longer in duration and higher budgets usually had some of the higher IMDB scores. New movies tend to perform really well as far as IMDB scores go as well.

Need Additional variables: Some additional variables that could be beneficial to look at include time of year of release, especially in refence to peak times or holiday seasons to see if there is a seasonality trend that predicts the highest scores. Another variable might be whether or not or how many Oscar nominations it received. In reference to publicity, it may be interesting to loook at how much was spent on the marketing campaigns to see if this might be a good predictor of IMDB score since some of the higher scored movies had a lot of awareness.

References:

All codes referenced from another source have the source directly listed in its specific cell.